Directions

Please document your answers to all homework questions using R Markdown, submitting your compiled output as a zipped .html folder (this is necessary when using plotly and leaflet).

\(~\)

Question #1

Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.

## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
data <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")

Part A

For Part A, use the stringr package, write code to clean these data by removing the Unicode values (strings like <U+00A0>). To do this, you should assume that anything appearing inside of the characters < and > can be removed.

Part B

On twitter, a user may echo another user’s tweet to share it with their own followers by “retweeting”. In these data, all retweets begin with the letters “RT” followed by “@” and the original user’s twitter name. For this question, write code that stores retweets into a separate data set, then use the length function to find the number of tweets in this dataset. Be sure to anchor the regex used to identify retweets.

Part C

After excluding retweets, find the number of tweets where “hate” or “hated” (of any capitalization) appear, and the number of tweets where “love”, “loved”, or “looved” (and all variants with more “o”s or other capitalization) appear. Hint: the sum() function can be used to count the number of TRUE elements in a logical vector, which can be used in conjunction with str_detect() to answer this question. You might also find logical negation, achieved using the ! character, to be helpful in creating a subset of non-re tweets. We’ve seen this before in-class with the command !is.na(...) being used to select cases without missing values.

Side Remark: To download tweets from Twitter, you need to have a Twitter account and then sign into the developer page. Analyzing twitter data makes for a potentially interesting project. Details on the authentication procedure can be found at this link: http://thinktostart.com/twitter-authentification-with-r/

\(~\)

Question #2

The Happy Planet Index is an attempt to measure how well different world nations are doing at achieving long, happy, and sustainable lives for their citizens using data compiled from various sources. A description of the dataset’s variable can be found on slide 11 here

For this question, use the plot_ly function in the plotly package to construct a 3-D scatter plot of “LifeExpectancy” and “GDPperCapita” vs “Happiness” with a fitted linear regression plane (found using lm()) depicting the model Happiness ~ LifeExpectancy + GDPperCapita. Your graph should include hoverable labels displaying the country represented by that data-point. You may use the argument hoverinfo = "text" so that these labels only provide the text label you specify (and not the x, y, z coordinates of the point).

HappyPlanet <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")

Your final result should look something like the graphic shown below (it does not need to resemble it exactly, but it should be similar):

Note: colorscale = "RdBu" was used above.

\(~\)

Question #3

A precinct is the smallest geographic unit for which aggregated voting data is publicly available. For this question, you will work with precinct-level election data for the state of Iowa from the 2020 US presidential election. These data were acquired through Harvard’s Dataverse, and were published by the Voting and Election Science Team in 2020 (https://doi.org/10.7910/DVN/K7760H).

To begin, you should download this zipped folder and extract it to an accessible location on your PC. The code below reads these data on my PC (note that I have the iowa_precincts folder in my downloads).

library(leaflet)
library(maptools)
iowa <- readShapeSpatial("C:/Users/millerry/Downloads/iowa_precincts/ia_2020")  ## You will need to change this file path

Part A

In Part A your goal is to create a ggplot map where each precinct is colored according the margin by which it favored Donald Trump (votes recorded in “G20PRERTRU”) or Joe Biden (votes recorded in “G20PREDBID”). The code below creates a new variable, “MARGIN”, that you can use for this purpose.

## Relative difference in Trump vs. Biden votes
iowa@data$MARGIN = (iowa@data$G20PRERTRU - iowa@data$G20PREDBID)/(
                      iowa@data$G20PRERTRU + iowa@data$G20PREDBID + 
                      iowa@data$G20PRELJOR + iowa@data$G20PREGHAW)

Your final result should look something like the map below (it does not need to resemble it exactly, but it should be similar):

\(~\)

Part B

In Part B your goal is to make an interactive leaflet map of Poweshiek County (home to Grinnell) and Jasper County (just west of Grinnell) showing each precinct’s name and vote totals for Donald Trump and Joe Biden when you hover over it. The map should also include a highlight that clearly displays which precinct the user is hovering over.

You may use the following code to subset the spatial polygons file to include only Poweshiek County.

## Create subset (note that pow_co is a spatial polygons file, just like "iowa")
pow_co <- iowa[iowa$COUNTY %in% c("Poweshiek", "Jasper"), ]

Your final result should look something like the map below (it doesn’t need to be exactly identical):